Finite-Sample Equivalence of Several Statistical 
Models for Presence-Only Data 



Statistical modeling of presence-only data has attracted much recent 
attention in the ecological literature, leading to a proliferation of meth- 
ods, including the inhomogeneous poisson process (IPP) model [12) . max- 
imum entropy (Maxent) modeling of species distributions [H] [S] [S]i ^nd 
logistic regression models. Several recent articles have shown the close 
relationships between these methods [l] [12]. We explain why the IPP 
intensity function is a more natural object of inference in presence-only 
studies than occurrence probabilities. Further, we explain why the above 
three techniques all essentially implement the IPP model, and prove exact 
finite-sample equivalence between the IPP model, Maxent, and a weighted 
form of logistic regression. That is, given the same data, the same basis 
expansions, and the same regularization penalty — and provided that the 
intercept a is unpenalized — the slope coefficient estimates /3 are identical 
for each method. The exact equivalence of these various models implies 
that every method is equally well-motivated and equally extensible with 
existing software packages. In particular, many methods that extend lo- 
gistic regression can also extend Maxent or the IPP model in exactly 
analogous ways. 

At several points we discuss issues of imperfect detection and observer 
bias, and describe how to use the IPP model to combine presence-only 
and presence-absence records from one or more species to address these 
issues. 

1 Introduction 

A common ecological problem is estimating where a species of interest can be 
found from records of where it has been found in the past. There are many 
motivations for solving this problem: planning wildlife management actions, 
monitoring endangered or invasive species, or simple scientific discovery. 

1.1 Presence-Only Data 

Estimation is simplest and most convincing when the observations of species 
presence are collected in a systematic manner. In a typical design, a surveyor 
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Figure 1. Koala rw ortls (courtesy of New South Wales National Parks & Wildlife Serviec) and the road network on 
part of the New South Wales rrorttr coast. 



Figure 1: Observer bias in presence-only data for koalas. Taken from Margules 
and Austen (1994). 

visits a one-hectare patch of land for one hour and records whether or not she 
discovers any specimens in that interval. The records of unsuccessful surveys 
are called absence records — a mild misnomer since ecologists recognize that 
specimens could be present but go undetected. Data sets of this form are called 
presence-absence data. 

Unfortunately, presence-absence data are often expensive to collect, espe- 
cially for rare or elusive species. For many species of interest, the only data 
available are museum or herbarium records of locations where a specimen was 
found. These presence-only records are often collected haphazardly and fre- 
quently suffer from unknown observer bias such as that illustrated in Figure [l] 
The clustering of koala sightings near roads and cities probably has more to do 
with the behavior of people than of koalas. 

1.2 What Should We Estimate? 

Before we can sensibly decide how to model presence-only data, we must adress 
the issue of what it is we are modeling in the first place. How should we even 
think of "occurrence," the scientific phenomenon nominally under study? This 
issue arises with presence-only and presence-absence data alike. 

1.2.1 Occurrence Probability 

To illustrate this conundrum, we include a typical "heat-map" output (Figure [2]) 
of a study of the willow tit in Switzerland using count data [Hj . The map reveals 
which locations are favored by the species and which are not (in this case, high 
elevation and moderate forest cover appear to be the bird's habitat of choice). 
The legend shows that the color of a region reflects the local probability of 
"occurrence." 

But precisely what event has this probability? Reading the paper, we dis- 
cover that occurrence means that there is at least one willow tit present on a 
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Figure 2: Typical heat map of occurrence probabilities. Taken from Royle et 
al. (2005) 

survey path through a 1 km x 1 km quadrat of land. The authors analyze a 
presence-absence data set using a hierarchical model that explicitly accounts 
for the possibility that a bird was present but not detected at the time of the 
survey. 

Because the survey path length varies across sampling units, the authors use 
it in their model as a predictor of presence probability. It is not specified which 
value of this predictor is used in generating the heat map, which makes the map 
difficult to interpret. 

Even if we could interpret the heat map as the probability of a bird being 
present anywhere in the quadrat (not just along the path), the probability of 
a bird being present would still be larger in a 2 km x 2 km sampling unit 
and smaller in a 100 m x 100 m one. Therefore the very definition of "occur- 
rence probability" in a presence-absence study depends on the specific sampling 
scheme used to collect the presence-absence data. Correspondingly, interpreting 
the legend of such a heat map can only make sense in the context of a specific 
quadrat size (typically, whatever size was used in the study). 

Though the choice to estimate occurrence probability in a specific quadrat 
size is ecologically arbitrary, it can at least yield estimates with a meaningful 
interpretation. By contrast, trying to estimate "occurrence probability" in a 
presence-only study is a far murkier proposition. Any method that estimates 
occurrence probabilities without reference to quadrat size would seem to be 
predicting the same probability of occurrence within a large quadrat or a small 
one, which cannot make sense. 

1.2.2 Occurrence Rate 

Since occurrence probability is only meaningful with reference to a specific 
quadrat size, it is an awkward quantity to model in a presence-only study. 
We feel that it is more natural to estimate an occurrence rate in this context, a 
quantity with units of inverse area (e.g. 1/km^) corresponding to the expected 
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number of specimens per unit area. Under some simple stochastic models for 
species occurrence, including the ones considered here, specifying the occur- 
rence rate is equivalent to specifying occurrence probability simultaneously for 
all quadrat sizes. 

Unfortunately, estimating this rate as an absolute quantity is not practical 
for typical presence-only samples. For instance, suppose our koala data set had 
twice as many sightings. It might mean that there were twice as many koalas, 
or it might only mean that there were the same number of koalas but twice as 
many motorists to see them, or that motorists were twice as likely to call in a 
sighting given that it occurred. 

Since the total number of sightings reflects nothing meaningful about the 
occurrence rate, the only real information at our disposal is the distribution of 
sightings across our domain of interest. If we are lucky, this distribution reflects 
the relative occurrence rate of the species, i.e. a function proportional to the 
occurrence rate. Even trying to estimate relative occurrence rates is susceptible 
to confounding as a result of observer bias, an issue we explore in more detail 
in Section [231 

1.3 Notation 

We now introduce notation we will use for the remainder of the article. We 
begin with some geographic domain of interest V, typically a bounded subset 
of M^. If the time of an observation is an important variable, we might take 
P C M"^, giving a point process in both space and time. 

Assume without loss of generality that the total area of I? is 1. Associated 
with each geographic location z e I? is a vector x(z) of measured features. 

Our presence-only data set consists of ni locations € 2?, i = 1, 2, . . . , ni. 
This data set is accompanied by no "background" observations Zi,i = ni + 
1, . . . , ni -|- riQ, typically a simple random sample from 2?. 

Finally let Xi = x{zi) be the features associated with observation i, and 
Ui = 1i<ni be an indicator of presence/background status. 

Our treatment of these data as random or fixed will vary throughout the 
article. 

1.4 Outline 

The rest of the paper is organized as follows. In Section[2]we define the log-linear 
inhomogeneous poisson process (IPP) model and its application to presence-only 
data, with special focus on interpretations of the model and its score equations. 
Aarts et al (2011) showed that many methods in species modeling can be mo- 
tivated by the IPP model. We explicitly draw these connections for several 
especially illuminating examples. 

In Section|3]we consider a particularly important example, showing that the 
Maxent likelihood of Phillips et al. follows immediately from partially maxi- 
mizing the IPP log-likelihood with respect to the intercept a. It is equivalent 
to the IPP in the sense that both methods produce exactly the same slope esti- 
mates (5 in any finite sample, and the intercept estimate a from an IPP is rarely 
scientifically interesting. 

In Section [4] we discuss logistic regression and its connections to the IPP 
model. Warton and Shepherd (2010) showed that the logistic regression can 
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be derived from the IPP model, and that for any fixed presence sample as the 
number of background samples tends to infinity the fitted logistic regression 
slopes converge to the fitted IPP slopes. 

However, if the log-linear model is misspecified this convergence may not 
occur until the background sample is extremely large, and may not occur at all 
if the number of presence samples grows along with the number of background 
samples. 

We show that by reweighting the samples we can use logistic regression to 
recover the IPP estimate /3 precisely in any finite sample. 

An advantage of having IPP as a unifying model for presence-only data is 
that we can derive a variety of other simple parametric forms for other types of 
data, including presence-absence and count data. In Section [5] we consider how 
we might use this fact to join disparate data sets into one likelihood function, 
either to share information across species, to estimate an overall abundance rate, 
or even to estimate the level of observer bias in presence-only samples. Section 
[6] contains discussion. 



2 The Inhomogeneous Poisson Process Model 

The IPP is a simple model for the distribution of a random set of points Z 
falling in some domain V. Both the number of points and their locations are 
random. 

An IPP can be defined by its intensity function 

X:V — ^[0,oo) (1) 

Informally, A indexes the likelihood that a point falls at or near z. For subsets 
ACV, define 

A(A) = / \{z) dz (2) 

J A 

and assume A(I?) < oo. 

There are two main ways to formally characterize an IPP with intensity A. 
One simple definition is that the total number of points is a Poisson random 
variable with mean A(I?), and their locations are independent and identically 
distributed with density p\{z) — A(z)/A(2?). The only difference between an 
IPP and a simple random sample from p\ is that its size is random. 

Alternatively, we can think of an IPP as a continuous limit of a poisson 
count model in discretized geometric space. Let N{A) — #(Zn A), the number 
of points falling in set A. An equivalent characterization of the IPP model is 
that 

N{A) - Poisson(A(A)) (3) 

with N{A) and N{B) independent for disjoint sets A and B. For more on the 
IPP and other point process models, see e.g. [2]. 

In the case of a finite discrete domain V = {zi,Z2, ■ . . , Zm}, the IPP model 
reduces to a discrete Poisson model, with 

Niz,) ^ Poisson(A(z,)) (4) 
In this sense, the IPP model is a limit of finer and finer discretizations of V. 
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2.1 Modeling Presence-Only Data as an IPP 



Warton and Shepherd (2010) proposed modehng the species sightings zi, . . . , z„^ 
as arising from an IPP sightings process whose intensity is a log-hnear function 
of the features x{z): 

X{z) = e"+'''^(^) (5) 

The hnearity assumption does not impose as many restrictions as it may appear 
to, since our choice of feature vector x{z) could incorporate polynomial terms, 
interactions, a spline basis, or other basis expansions. 

If we adopt the interpretation of an IPP as a simple random sample with 
random size, we see that a and /3 play very different roles. Since a only scales 
X{z) up or down, it has no effect on px = X/A{D). The "slope" parameters 
/3 completely determine px, while a merely scales the intensity up or down to 
attain any desired average sample size A(2?). 



2.2 Maximum Likelihood for the IPP 

Like many exponential family models, the log-linear IPP has simple and en- 
lightening score equations. The log-likelihood is 

£(a, /3) = V a - [ e°'+^''=^^) dz (6) 

Differentiating with respect to a we obtain the score equation 

m = / e"+'^'^(^) dz = A{V) (7) 



Jv 

That is, whatever /3 is, a plays the role of a "normalizing" constant guaranteeing 
that A(z) integrates to ni. This is our first glimpse at why a is typically not of 
scientific interest, since it basically encodes the total number of museum records 
we have. 

Solving for a in Q we obtain the partially maximized log-likelihood 



Rearranging terms and ignoring constants, we have 



P'x, - m (8) 



r (/3) = V - ni log ( f e'^'^f") dz 
y.=i ^-^^ 



(9) 



= J2 ^OgPx{z^) (10) 

the same log-likelihood we would obtain by conditioning on ni and treating Zi 
as a simple random sample from px ■ 

Differentiating ^ with respect to (3 and dividing by ni gives the remaining 
score equations: 

/pe'3'^(^)x(z) dz 
eP'<^) dz 

Kp,x{z) (12) 



yi — 1 *^ ^ 
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This amounts to finding /3 for which the expectation oix{z) under px (z) matches 
the empirical mean of the presence samples. 

Maximizing the likelihood of a log-linear IPP model, then, amounts to 

1. Choosing /3 so p\ matches the means of the features x(z) in the presence 
sample. 

2. Choosing a so that X(z) — nip\{z). 

The first step is really a parametric density estimation problem for the presence 
sample, and the second step doesn't matter if ni has no scientific interpretation. 
Whenever we don't care about ni , the IPP is really nothign more than a density 
estimation procedure (and only a little more complicated if we do). 

Despite our denigration of ni (and therefore a) as a quantity of scientific 
interest, there is one exception we can think of. When several species are under 
consideration, it might be interesting that species 1 was sighted twice as often 
as species 2 — especially if we can obtain an independent estimate of the true 
abundance level of species 2, say through presence-absence data. We expand 
upon this idea in Section [Sj 



2.3 Numerical Evaluation of the Integral 

The IPP likelihood and score equations involve integrals that, in general, we 
cannot evaluate analytically. However, we can use the background samples 
to evaluate it via Monte Carlo. Since dz = 1, we can approximate any 
/(z) dz by an average '^iLi fi^i) over a uniform random sample z^. If 
our background points comprise such a sample, we can replacing the original 
log-likelihood with 

yi=i ^ y.=o 

Two other options for numerically evaluating the integral are to choose back- 
ground points in a regular fine grid of T), or to assign quadrature weights to the 
background points and approximate the integral with a weighted sum. In the 
first case, the optimization criterion would be the same, and in the second the 
only difference would be that the second sum would be a weighted sum over the 
background points. 

Repeating the previous derivation gives the numerical version of the score 
equations 



. i- J: e-^'- (14) 



no „ 

Vi=0 



Throughout, we will refer to ( 13 ) as the numerical IPP log-likelihood to dis- 
tinguish it from the true IPP log-likelihood (|6|. In practice, "fitting" the IPP 



model model would mean fitting (13 1. 
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2.4 Connection to Poisson Log-Linear Model 



Suppose that x{z) is a continuous function on V. If we use background points 
from a regular fine grid, we are essentially discretizing 2? into no pixels Ai, 
each of approximately the same size, and centered at Zi. If we use the 
approximation 

A(^,) = / e"+'^'^(^) dz (16) 
« \A,\e"+P'-^ (17) 

no 

then the IPP model implies that the counts in each neighborhood Ai are gen- 
erated via the Poisson LLM 

N{A,) - Poisson (^^^e^+z^'^'^ (19) 

The log-likelihood of this model is (up to an additive constant) 

£{a, 13) N{A,){a + p'x,) - — V e"+'^'^- (20) 



— no ^ — ' 



Since a;(z) is continuous, 



N{A,){oi^0x.-:)=yyoi^0x, (21) 

P'xk (22) 
a + ^'xk (23) 



Zk<^Ai 



Therefore, ( 13 1 is almost exactly the same as the Poisson LLM log-likelihood for 



this discretized model. The only difference between the two is that in (20) we 
have also discretized the location of each presence sample to match its nearest 
background sample. 

We could indeed fit an IPP model in exactly this way, by simply deleting 
the features of the presence samples and recording only how many fall into each 
background sample's surrounding pixel. Approximating the model in this way 
provides a simple way of accessing the generalizability of GLM methods (see 
Section |4]). 

As we have seen, this is essentially akin to replacing Xi for each presence 
point with the Xi of its nearest background sample. As we will see in Section 
|4.2[ we can do essentially the same thing without making this approximation if 
we use a weighted GLM method. 



2.5 Identifiability and Observer Bias 

Observer bias poses one of the most serious challenges to valid inference in 
presence-only studies. Scientifically, we are interested in the occurrence process 



8 



consisting of all specimens of the species of interest. However, our data set 
comes from the observation process consisting only of the occurrences observed 
and reported by people. 

Assume that each occurrence is observed with probability s(z), which may 
depend on features of the geographic location z. If observation is independent 
across occurrences, then the observation process is an IPP with intensity 

Xohs{z) = Xocc{z)s{z) (24) 

A presence-only data set only directly reflects Aobs- Absent any assumptions 
about s the underlying intensity Aocc is not identifiable. 

One (optimistic) assumption we could make about s is that it is an unknown 
constant. In that case, by estimating Xohs{z) we are also estimating Aocc(-2) up 
to an unknown constant of proportionality s. Even in this optimistic scenario 
we can only estimate relative occurrence intensities, not absolute intensities. 

A bit more realistic is the assumption that s is an unknown function of z, 
but that s and Aocc are known to depend on z through two disjoint feature sets. 
For instance, we could model Aqcc and s as log-linear in features Xi{z) and X2{z) 
respectively 

Aobs(z) = Xocc{z)s{z) (25) 

Then the observation process follows the log-linear model Aobs — e"^^ ^'^^ with 
X = (j^j) and /3 = (^^) . Note that a and /3 are the quantities of primary scientific 
interest, whereas a and /3 are the parameters governing the process we actually 
observe. Nevertheless, (3 = (3i is still identifiable from the data because /3 is. [J 
As nQ,ni oo, our estimate /3i converges to the true value of /3, the slope 
coefficients of Aocc- However, a will converge not to a but rather to a -I- 7. 
Without knowing 7 we have no way of estimating a. Similarly, if xi and X2 
had overlapping (or linearly dependent) coordinates we could not estimate /3 for 
those coordinates. 

To be more concrete, suppose koala occurrence is known to depend only on 
elevation, and that observer bias is known to depend only on proximity to roads. 
Then, despite the obvious observer bias in Figure [T] we could still estimate what 
elevation koalas prefer. By contrast, we could never estimate from this data 
whether koalas tend to avoid roads. 

Even in the most optimistic scenario, we can estimate a = a + j but it 
provides no real information about a. Indeed, we have seen that the only role 
a plays in estimation to make A integrate to ni. 

The distinction between /3 and /3 is important, but for most of this paper 
we will focus on estimation of /3, the slope parameters of the process we get to 
observe. We revisit this distinction in Section [5] 

^As with any regression adjustment scheme, we should proceed with caution here. If our 
Hnear model is misspecified (perhaps we should have included x|) and xi is correlated with the 
missing variables, even regression adjustment will not remove all bias. In perverse situations 
it could even make the situation worse. Of course, this must be weighed against the fact that 
if there is observer bias, not accounting for it at all gives biased estimates too. See Section[5] 
for another option for dealing with observer bias. 
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2.6 Geographic Space and Feature Space 

In the context of logistic regression, it will be more natural to think of the Xi 
being a sample of points in "feature space" (i.e. the range of x) rather than as 
the features corresponding to a sample in the geographic domain V. There is 
no real distinction between these two viewpoints, so long as we adjust for the 
fact that some values of x are more common in V than others. 

Suppose the Zi for presence samples (j/i = 1) arise from an IPP with intensity 
A(x(z)). We will show that the corresponding Xi are then an IPP with intensity 
^x(a;) = X{x)h{x), where 

h{x) = I dz (27) 

J {^z: x{z)—x} 

Suppose x were discrete. Then h{x) would be the total area of land with features 
equal to x, and if a presence sample were taken uniformly at random from V 

{j3 = 0) its probability of having features x would be proportional to that area. 
For continuous x, h{x) is the marginal density of x in our domain V, but the 
same intuition applies. 

Suppose B is some subset of feature space, and consider the number Nx{B) 
of Xi falling in the set B. This is the same as the number of Zi falling in the 
inverse image A = x~^{B) = {z : x{z) e B}. That is, 

N^{B) = N{A) ~ Poisson (A(A)) (28) 

But 

A{A) = [ e^(^(^)) dz (29) 

J A 

e^'^'^Uzdx (30) 

= [ e^(^)/i(a;)rfa; (31) 
Jb 

Furthermore, if Bi and B2 are disjoint sets, then so are Ai = x'^^{Bi) and 
A2 = x-'^{B2). It follows that N^{Bi) = N{Ai) and N^iB2) = iV(^2) are 
independent, so the Xi satisfy our second definition of an IPP. 

In terms of our "random sample" view of an IPP, the above derivation im- 
plies that Xx integrates over the whole of feature space to A(r'), and the Xi 
corresponding to = 1 are distributed with density e"~^^ ^h{x)/ K{T>). 

3 Maximum Entropy / Conditional IPP 

Another popular approach to modeling presence-only data is the Maxent method, 
proposed by Phillips et al. (2004). The authors begin by assuming that the 
presence samples z\,...,Zni are a simple random sample from some probability 
distribution p{z). 

The authors adopt the view, inspired by information theory, that the es- 
timate p should have large entropy II{p) = — f^p{z)log{p{z)) dz, while also 
matching certain moments of the sample. Intuitively, the goal is for the estimate 
to be as "close to uniform" as possible, while still satisfying certain constraints 
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that make it resemble the empirical distribution. Indeed, if we maximized en- 
tropy with no constraints, we would estimate p as the uniform distribution over 
V. 

They propose to choose the p which maximizes H{p) subject to the constraint 
that the expectation of the features x{z) under p matches the sample means of 
those features, i.e. 

— Xi = [ x(z)p(z] dz ~ E„a;(z) (32) 

Phillips et al. (2004) show that this criterion is equivalent to maximizing the 
likelihood of the parametric model 

This is exactly the parametric form of p\ for our log-linear IPP, and its log 
is exactly the partially maximized log-likelihood The likelihood ([3| is 

simply the likelihood of a simple random sample from p\^ i.e. a conditional 



IPP. Indeed, the constraint (32) is nothing more than the score criterion for (3 
in an IPP. This result may be found in Appendix A of Aarts et al. (2011). 

The popular software package Maxent implements a method slightly more 
complex than the one originally proposed in 2004. First, it automatically gener- 
ates a large basis expansion of the original features into many derived features 
(quadratic terms, interactions, step functions, and hinge functions of the orig- 
inal features). Then, it fits a model by optimizing an ^i-regularized version of 
the conditional IPP likelihood: 



mlog / e^'<'Uz]-Y,rm (34) 

The regularization parameters rj are chosen automatically according to rules 
based on an empirical study of numerous presence-only data sets . 

Mathematically, the basis expansion only increases the length of the feature 
vector x{z). Moreover, the regularization scheme does not constitute an 
essential difference with the other methods considered here. One could (and 
often should) regularize the parameters of a fitted IPP process as well, especially 
if x{z) contains many features resulting from a large basis expansion. 

Applying a penalty J(/3) to the Maxent log-likelihood does not change the 
equivalence between the two models. Indeed, if we add a penalty term J(/3) to 
the IPP log-likelihood ([6|, we still obtain ([7| after differentiating with respect 
to a. But then, when we partially maximize -^(a, /3) — J(/3) we simply obtain 
£*(/3) — J(/3), the penalized Maxent log-likelihood. Note that this equivalence 
depends on our not penalizing a. 

This argument generalizes immediately to a generic penalized likelihood 
method with a parametric form for logA(z). We have established the following 
general proposition: 

•^In the notation of the Maxent papers the identities of A and f) are interchanged relative 
to the notation in this article. 
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Proposition 1. Given some parametric family of real-valued functions {fg : 
6 G W} with penalty function J{0), consider the penalized negative log-likelihood 
for an IPP with intensity e^+Z''^'^^^)) 



^ Jv 



(35) 



Vi = l 



and the penalized negative log-likelihood for a conditional IPP with density pro- 
portional to e^"''^^'^^'' 



92{0) 



V feix^) + ni log ( [ e"+^«(^(^» dz] + J{e) 



(36) 



Then (35) and (36) are equivalent in the sense if {a, 9) minimizes gi, mini- 
mizes (72; OLi^d if 6 minimizes g2, there exists a unique a for which {a, 9) mini- 
mizes gi. 

The same applies if we replace the integrals in (35) and (36) with sums over 
the background sample. 



Proof. Partially optimize 171 over a as in ([8|) to obtain 32 • 



□ 



Thus we see that, while Maxent and the IPP appear to be different models 
with different motivations, they fit the exact same density p\. Fundamentally, 
this is a consequence of what we observed in Section |2.2[ Maxent solves the 
same density estimation problem as step 1 of the IPP-fitting procedure, then 
skips step 2. 



4 Logistic Regression 

Another ostensibly different model for presence-only data is the so-called "naive" 
logistic regression. This approach treats presence-only modeling as a problem 
of classifying points as presence [y = 1) or background {y — 0) on the basis of 
their features. The logistic regression model treats ni, no, and the Xi as fixed 
and the yi as random with 

Superficially this approach may appear ad hoc and scientifically unmotivated 
compared to IPP or Maxent. Weighed against this concern is the fact that 
logistic regression is an extremely mature method in statistics, enjoying myr- 
iad well-understood and already-implemented extensions such as GAM, MARS, 
LASSO, boosted regression trees, and more. 

Logistic regression modeling of presence-only data has often been motivated 
by analogy to logistic regression for presence-absence data. Since it is not known 
whether the species is present at or near the background examples, these are 
sometimes referred to as "pseudo-absences," and the supposed naivete of the 
method refers to the fact that it treats them as actual absences. Various authors 
have introduced latent variables coding "true" presence or absence and have 
tried to fit this model via the EM algorithm or hierarchical methods [TT], [TUj. 
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This interpretation raises once again the troublesome question of what it 
would actually mean for one of our randomly sampled background points to be 
a "true absence" or "true presence." Would there need to be a specimen sitting 
directly on the location? Or is it enough for it to be within 100 m? 1 km? In 
fact, if we adopt the first view it would seem that our worries are over since 
we can assume there is a near-zero probability that a random geographic point 
coincides with the species of interest. 

Fortunately we can sidestep these concerns entirely, since there are deep 
connections between the logistic regression and IPP models which yield a more 
straightforward interpretation. Warton & Shepherd (2010) showed that when 
the IPP model holds, so does the logistic regression model, and that if rii is 
held fixed and ng — ?> oo, the difference between the fitted $ for the two models 
converges to 0. 

We will show that the log-linear IPP model implies that P{yi\xi) is exactly as 
in (37), with the same slope parameters (3. However, if this model is misspecified 
and Til grows along with no, the two models' estimates for /? generally do not 
converge to the same limit. The limiting parameters of the logistic regression in 
general will depend on the limiting ratio of no to ni (Fithian and Hastie, 2012). 

Nevertheless, by using a weighted form of logistic regression we will see that 
we can exactly recover the IPP estimate for /3 in finite samples. This implies in 
particular that packages implementing extensions of weighted logistic regression 
can be used to fit analogous extensions of the IPP model. 



4.1 Case-Control Sampling 

If we begin with a log-linear IPP model and condition on the size of our presence 
and background samples, the mixture of two simple random samples, 

one from the density e"^^ ^h{x)/A{'D) and the other with density h{x). It 
follows that 



1 + e''+/5'^ 



(39) 



ith e'' = 



This logic is depicted in Figure jsj 



noACD) ■ _ 

Thus, the log-linear IPP model implies the individual yi\xi follow a logistic 
regression model with the same slope parameters /3. 

Thus, given any finite sample of presence and background points, we could 
either maximize the numerical IPP likelihood or the logistic regression likeli- 
hood, and in either case we would be fitting the same model. This fact alone 
does not guarantee we will obtain the same estimates (3 in any given finite sam- 
ple, but if the log-linear model is correctly specified then maximizing either 
likelihood gives a consistent estimator of the true /?. 

However, when the log-linear model is misspecified, the fitted slopes for lo- 
gistic regression and numerical IPP will in general not converge to the same 



■^The j/i are technically not independent given no and ni (if we knew the other ni -\-no — 1 
labels, we would know the last as well). This is always true in case-control studies, but it is 
typically ignored since the dependence is weak for large samples. 
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Figure 3: Presence-only sampling as case-control logistic regression. 



limiting /? if rii and Uq grow large together. In fact, the limiting logistic regres- 
sion parameters depend on the limiting ratio of ni/no- 

The reason is that when a model is mispecified (as are most parametric 
models, such as the linear model), then what we are estimating is the best 
parametric (linear) approximation in the population to the true function. But 
when we change the mix ui/hq, we are in effect changing the true population, 
and hence the parametric (linear) approximation (Fithian and Hastie, 2012). 

If we modify the logistic regression procedure a bit, we need not wait for 
an infinite number of background samples. We can recover the same /3 that we 
would estimate with a IPP using the same no background points. 

4.2 Weighted Logistic Regression 

Typically, although we use a finite background sample, we actually have an 
infinite (or at least much larger) reservoir of possible background points we 
could have used. Suppose we view each background point in our sample as 
a representative of many more background points which we only excluded for 
the purpose of computational convenience. To reflect this wc might assign case 
weights to the samples 

Wi = <, , K ■ (40) 
[ 1 otherwise 

for some large number W . We would then obtain the weighted log-likelihood 
function 

%lr(?7, P)^Y.'"^ + ^'^') - log (l + 6"+^'"')] (41) 

i 

= + (3'x^ -Wj2^og{l + e''+^'^') - ^ log (l + e''+'3'^') 

yi=i yi=o yi=i 

(42) 
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If a unique MLE (aipp,/3ipp) exists for the IPP model, and {aw,$w) solve 
(41 1 for weighting factor W, we have 

lim ^w=PiPP (43) 

W—¥CO 

We prove a more general version of this fact, again allowing the possibility of a 
general penalized likelihood approach. 

Proposition 2. Consider a family {fg : 9 € W} where 9 i-> .fe{x) is concave 
and differentiable, with convex penalty Junction J{9). 

Suppose the penalized numerical negative log-likelihood for an IPP with in- 
tensity e°+/e(^(^)) 

g,{a,9) = -J2a + fe{x.) + -Y.e'^+M^^^+J{9) (44) 
y^=^ ^ yi=o 

has a unique minimizer (dipp, ^ipp). Also, define the penalized weighted logistic 
regression log-likelihood 

gwiv, + + E log (l + e"+^'^"'^) (45) 



W log (l + 6"+^" + J (9) 



y,=o 

Then if (f/WkT^^Wk) '^''^V sequence of minimizers of g^r^, with Wk — > oo, we 
also have 

Owk ^ ^"iPP (46) 

Proof. Define 

gw{a,9) = gw{a-\ogWnQ,9) - uilogWu^ (47) 
= - E " + /«(^') + wY,^og(l + ^e-^+M^A (48) 







a shifted version of gw- Since {rjWk + ^ogWnQ,9wk) rninimize gwki it suffices 
to show that any sequence of minimizers of gwk ^-Iso satisfy ( 46 1 . 

In particular, we will show that as — > oo, gw — ^ gi uniformly on compact 
subsets of Since gw and gi are convex and the minimizer of gi is unique, 

uniform convergence in any neighborhood of (Q!ipp,0ipp) is sufficient to prove 
the claim. 

The difference between the two criteria is 
~gw{a,9)-gi{a,9)=Y, \og(^ + ^^e'^+S^'^^-''^ - ^e+M^^^^ (49) 
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For any compact Q C Rp+i, we have 

sup a + fe{xi) ~ B < oo (50) 

l<z<no+ni 

(Q,e)ee 

so that vP^^""'''''''^^''' tends uniformly to for all values of a, 9 and Xi under 
consideration. 

Using the Taylor expansion log(l + u) = u + 0{u^), we obtain 

sup \giia,e)-gwia,e)\^0{W'') (51) 
(Q,0)ee 

proving the result. □ 

The above proof goes through with virtually no modification if we substi- 
tute for the logistic regression log-likelihood the poisson log-linear model log- 
likelihood: 



tWLLM 



{r,, P) = Y.w. [y,{r, + (i'x,) - (52) 



Proposition 3. Under the same conditions as Proposition^ if instead 

yi=o 



then (4-6) holds as before for any sequence of minimizers of gy/^. with Wk — > oo. 
Proof. As before, define 

gw{a,e) ^ gw{a-\ogWno,9) - nihgWno (55) 

(56) 

Then 



(57) 

which tends uniformly to zero on compact subsets of MP+^. The rest of the 
proof is the same. □ 

These two results also imply that logistic regression and poisson regres- 
sion converge to each other when we upweight the negative examples. This 
phenomenon has a simple heuristic explanation. As we upweight the nega- 
tive examples we drive all the fitted means toward zero, by driving fj to —oo. 
There is hardly any difference between a Poisson(e^) random variable and a 

Bernoulli ^j^^^ random variable for very negative A, and for that reason there 
is hardly any difference between the two GLMs. 
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4.3 Logistic Regression as Density Estimation 

One interpretation of the results we have just reviewed is that in the context 
of presence-only data, logistic regression estimates the same density estimation 
problem as Maxent and the IPP do. Moreover, weighted logistic regression even 
finds the exact same estimates as the numerical IPP and Maxent procedures do. 

Using logistic regression for density estimation is not without precedent. It 
was previously discussed in Section 14.2.5 of Hastie et al. (2009) as a means for 
turning the unsupervised problem of density estimation into a well-understood 
supervised classification problem of samples against background. The specific 
proposal in that book chooses a different weighting scheme (assigning half the 
total weight to each sample), which does not coincide exactly with IPP. 

We believe that viewing logistic regression as a density estimation procedure 
resolves many of the conceptual misunderstandings that originally led to its 
labeling as "naive." 

4.4 Simulation Study: Weighted vs Unweighted Logistic 
Regression 

We have seen that both weighted logistic regression (a.k.a. numerical IPP) 
and unweighted logistic regression estimate the same /? parameter of the same 
IPP model, and when the background sample is much larger than the presence 
sample, the estimates $ are close to each other. 

However, the weighted logistic regression estimate can converge much faster 
to the large-background-sample limit if the linear model is misspecified. We 
illustrate this phenomenon with a simulation study. We first generate ni — 
1000 positive examples from the normal distribution with mean 1 and standard 
deviation 2. Fixing the presence sample, we then repeatedly generate random 
background samples of various sizes uq. The negative Xi are generated from the 
normal distribution with mean and standard deviation 1. 

For each background sample, we fit both a weighted {W = 1000) and un- 
weighted logistic regression to the combination of presence and background 
points. There are twenty replicates per value of no, and the results are shown 
in Figure |4] For relatively large sizes of background sample, there is very little 
sampling variability, but the logistic regression estimates carry a large bias that 
depends greatly on the size of the background sample. The limiting /3, to which 
both methods would converge given an infinite background sample, is depicted 
with a horizontal line. 

Since the choice of background sample size is primarily a matter of con- 
venience, it is preferable to use an estimator that depends on it as little as 
possible. When the linear model is misspecified (which is nearly always the 
case), we recommend the weighted logistic regression / IPP for this reason. 

5 Pooling Data from Multiple Sources 

We have seen that the IPP model is a unifying framework for understanding 
several popular approaches to modeling presence-only data. Its simple form 
induces familiar parametric forms for other quantities we might model as well, 
for example presence-absence and species count data. 
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Weighted and Unweighted Estimates for Logistic Regression 
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Figure 4: Unweighted logistic regression may require a very large background 
sample before convergence 
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If a team of surveyors exhaustively survey plots of land Ai with areas \Ai\ 
and covariates Xi, then the number Ni of specimens they encounter will be 
distributed poisson with mean A{Ai) — \Ai\e"~^^ Recall that a and f3 are 
the parameters of the underlying occurrence process — Aqcc in ( 25 ) . This is a 
poisson LLM with offsets log |^ 

If instead we only record whether or not at least one specimen was encoun- 
tered, then we have a bernoulli GLM with complementary log-log link and offset 
1^,1 

P(A^, > 0\x,) = 1 - e-l^'l^"'"'''' (59) 

The fact that one model yields coherent likelihoods for all these sampling schemes 
means we can pool together presence-only, presence-absence, and count data 
into a single log-likelihood. 

Why might we want to do this? First, suppose we make the assumption of 
no selection bias, i.e. (3 — (5. Then if we had access to a small presence- absence 
data set and a large presence-only data set, the presence-only data would help 
us to pin down /3 allowing the presence-absence data to more efficiently estimate 
a. 

In practice, it may be too much to assume the surveyors see every animal, 
but we could assume that at least there is no dependence on z, and surveyors see 
each animal independently with equal probability. In this case, the intercept for 
presence-absence data is not a but a — e, where e > 0. Presence-only sampling 
will not tell us the overall level of abundance but it might still give a useful 
lower bound if e is not too large. 



5.1 Estimating Sampling Bias 

Suppose that we are not willing to assume away sampling bias, but we do believe 
that it is the same for several species. For instance, proximity to roads and cities 
may introduce a comparable observer bias on similar species. 

With this assumption as motivation, Phillips et al. (2009) proposed using 
other species' sightings as background observations instead of randomly sam- 
pled locations. This method, called the "target-group background" (henceforth 
TGB) method by the authors, effectively "controls away" any observer bias that 
affects all species equally. 

Unfortunately, the TGB approach also controls away any real environmental 
factor that affects overall abundance. If half of our environment is a lush rain- 
forest and the other half is a desert wasteland, this method implicitly assumes 
that the overall abundance in each region is the same, and the only reason we 
have fewer samples from the desert is observer bias. 

If we have several similarly-biased species with comparable observer bias, we 
might use the model 

Aocc,j(^) = e"^+^>(-) (60) 
Aobs,,(^) = e"^+-'+(ft+*)'^(^) (61) 

where j indexes species. 

All the parameters of this model are identifiable once we have 

^We should proceed with caution in modehng count data as Poisson, since the actual counts 
arc likely to be ovcrdispersed. 
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1. Presence-only data from every species. 



2. Presence-absence or count data for a single species (say, j = 1). 

The presence-only data lets us independently estimate aj and /3j for each 
species, while the presence-absence or count data lets us estimate ai and /3i. 
This is enough to estimate the other ctj and /3j because 

= - 7 (62) 
— aj + ai — ai (63) 

/3, = /3, - <S (64) 

= /3, /3i - /3i (65) 

If there is imperfect but unbiased detection in the presence-absence study, then 
as before we can only estimate aj — s and not aj. However, this would not 
affect identifiability for f3j. 

Depending on the species involved and the amount of data available, we 
may also benefit by shrinking estimates of the aj toward each other or sharing 
information in some other way. 



5.2 Simulation Study 

In this section we simulate a simple fictional landscape with ten species, two 
environmental features, and a third feature which induces observer bias. 

Our geographic domain is the unit square [0, 1]^, and the two environmental 
variables are identified with the geographic axes. For simplicity we have dis- 
cretized the geographic space into pixels, so there is no distinction between the 
LLM and the IPP. 

The first variable, "wasteland," is negatively associated with all ten species 
by varying amounts, and the second, "elevation," is positively associated with 
some species and negatively for others. The particular /3j for each species is 
randomly generated. 

The region also contains four "towns" in which human population, the third 
variable, is large. The human population does not affect the species, but it 
does have a multiplicative effect on the rate at which sightings occur. Figure [5] 
displays the relative observance and occurrence intensities for two species. 

Our data consist of presence-only data sets for each species, each with 300 
points in expectation. First we simply fit an IPP to each species using the other 
species' observations as a background sample, following the TGB proposal. This 
method essentially looks for contrasts between the observation intensities of the 
different species. Its estimates are listed in Table [l] under the heading TGB. 

It succeeds in removing the observer bias from the estimated intensities, but 
it additionally removes the average effect of the wasteland on all species, produc- 
ing severely biased estimates. Given the data we have and no other information 
or assumptions, there is fundamentally no way to distinguish between observer 
bias and a legitimate environmental effect common to the different species. 

If we knew ahead of time that the human population variable only affects s{z) 
and the environmental variables only affect Aobsi we could use the regression- 
adjustment approach of Section [2. 5| for each species. Because that assumption is 
correct in this case, regression adjustment successfully adjusts for the observer 
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Species 1, Occurrence Rate 



Species 1 , Observation Rate 




.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1. 

wasteland wasteland 



Species 3, Occurrence Rate Species 3, Observation Rate 




.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1. 

wasteland wasteland 



Figure 5: Aocc and Aobs for two species. Red is high, blue is low. 
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bias and not for the true environmental effect. The estimates from this method 
are in Table [T] under the heading RA (for regression adjustment). 

If we were not willing to make this assumption (for instance, a priori it could 
be that human population has a real effect on the species' occurrence process), 
we could still account correctly for observer bias by fitting the model defined 
by ( [eOpTj ). This model requires additional data: unbiased presence-absence or 
count data for at least one species. Here we independently simulate count data 
for species 1, "collected" from 300 randomly chosen sites. Since the distribution 
of the counts depends on sampling area \Ai\ and detection probability e, we 
choose these implicitly so that their means are proportional to e^^^^ (equal 
sampling area) and the average site has expected count of 2. 

Because we have discretized geographic space, the presence-only data can 
also be represented as counts over the pixels. This implies we can maximize 
likelihood for ( 60]|6T ) by simply fitting a single poisson LLM to the concatenated 
records of the 11 (10 presence-only and 1 count) data sets, using the R function 
glm. The estimates from this method are in Table [T] under the heading DP (for 
data pooling). 



Log Human Pop. 



Wasteland 



Elevation 



Species 


True 


TGB 


RA 


DP 


True 


TGB 


RA 


DP 


True 


TGB 


RA 


DP 


1 


0.00 


0.01 


0.00 


-0.08 


-1.55 


0.93 


-1.23 


-1.52 


0.42 


0.40 


0.43 


0.33 


2 


0.00 


-0.17 


0.00 


-0.24 


-2.09 


-0.07 


-2.16 


-2.45 


0.98 


0.87 


0.89 


0.79 


3 


0.00 


0.05 


0.00 


-0.07 


-2.79 


-0.72 


-2.67 


-2.96 


-0.39 


-0.44 


-0.29 


-0.38 


4 


0.00 


-0.18 


0.00 


-0.26 


-1.43 


0.77 


-1.31 


-1.60 


-1.04 


-1.15 


-1.03 


-1.13 


5 


0.00 


0.33 


0.00 


0.22 


-1.96 


0.15 


-1.98 


-2.27 


1.78 


1.52 


1.51 


1.42 


6 


0.00 


-0.07 


0.00 


-0.15 


-2.07 


0.00 


-1.92 


-2.21 


-2.31 


-3.19 


-2.90 


-3.00 


7 


0.00 


0.24 


0.00 


0.13 


-2.35 


-0.31 


-2.34 


-2.63 


0.88 


0.38 


0.46 


0.36 


8 


0.00 


-0.01 


0.00 


-0.10 


-1.88 


-0.05 


-2.11 


-2.40 


0.04 


0.24 


0.32 


0.22 


9 


0.00 


0.05 


0.00 


-0.06 


-2.99 


-1.12 


-3.10 


-3.39 


1.01 


0.85 


0.90 


0.81 


10 


0.00 


0.01 


0.00 


-0.08 


-1.93 


0.09 


-1.99 


-2.28 


0.43 


0.43 


0.50 


0.41 



Table 1: Results of three methods for correcting observer bias. The target-group 
background method succeeds in controlling for the Human Population variable, 
but in so doing controls away the Wasteland variable. 



6 Discussion 

We have shown here that the IPP, Maxent, and weighted logistic regression are 
equivalent in several senses: 

1. All may be derived from the IPP model. 

2. All may be thought of as performing the same parametric density estima- 
tion problem, which amounts to fitting /3. 

3. All fit the same /3 given the same finite presence and background samples. 

The only difference between the IPP and the other two methods is that it fits 
an intensity equal to ni times the fitted density. 

These findings remain the same if we replace the f3'xi term in the exponent 
with a smooth parametric model that is convex in its parameters 9, or apply a 
convex penalty to /3. 
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Logistic regression is one of the most widely applied methods in statistics. 
For decades, applied statisticians have been developing, studying, and using 
variations on logistic regression to solve classification problems in statistics. R 
packages exist for fitting generalized additive models (GAMs), boosted regres- 
sion trees, MARS, and every manner of tailored regularization schemes (see, 
e.g., 0). 

All of these methods are well-understood within the context of logistic re- 
gression, and we believe that the most important practical implication of the 
equivalence between the IPP model and weighted logistic regression is that all 
of these methods can now be equally well-understood within the context of the 
IPP model. 

For instance, we can fit an IPP / Maxent version of boosted regression trees 
with the following single line of R: 

boosted. ipp <- gbm(y~., f amily="bernoulli" , data=dat, weights=lE3~ (1-y) ) 

For an IPP / Maxent version of LASSO, ridge, or the elastic net: 

lasso. ipp <- glmnetCdat .X, dat.y, f ajnily="binomial" , weights=lE3~ (1-y) ) 

For an IPP GAM: 

gam. ipp <- gam(y~s(xl)+s(x2)+x3*x4, f amily=binomial , weights=lE3~ (1-y) ) 

Finally, the same IPP modeling framework induces likelihoods for other 
forms of data which may not have the same biases as presence-only data sets, 
which may then be combined with the presence-only data into a single modeling 
function. We have demonstrated one technique for using this data to correct 
for observer bias. 

This added flexibility promises to provide a powerful tool to modelers of 
presence-only data. 
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